Analyzing The Combined Dataset

First let's import the necessary libraries.

Also import the visualization libraries.

Let's define a function so that we can easily load the datasets.

Let's import the dataset.

Let's check the data.

Check the dataset using info().

Let's check the shape.

Combine The Datasets

We'll now combine the datasets together. But first let's add a column called Academic Institution designating which academic institution each student belongs to.

Start with the university_df.

Let's check the data.

Let's check the values in Academic Institution.

Now do the same operation to college_df.

Let's check the data.

Let's check the values in Academic Institution.

Now do the same operation to school_df.

Let's check the data.

Let's check the values in Academic Institution.

Finally, let's combine datasets.

Let's check the data.

Let's check the shape of the new dataset.

Now let's check all the categorical attributes individually. Start with Gender first.

Check Age

Check Academic Institution

Check Frequently Visited Website

Check Effectiveness Of Internet Usage

Check Devices Used For Internet Browsing

Check Location Of Internet Use

Check Household Internet Facilities

Check Time Of Internet Browsing

Check Frequency Of Internet Usage

Check Place Of Student's Residence

Check Purpose Of Internet Use

Check Browsing Purpose

Check Webinar

Check Priority Of Learning On The Internet

Check Internet Usage For Educational Purpose

Check Academic Performance

Check Barriers To Internet Access

Plot the data

Now we can plot the data. Let's write a couple of functions so that we easily plot the data.

This function saves the figures.

This function plots histogram and box plot of the given non-categorical data.

This function plots histograms of the given categorical data.

let's define a function to create scatter plots of the numerical values and check the distribution of the attribute values against the target column, Academic Performance

let's define a function to create scatter plots of the numerical values and check the distribution of the attribute values against the target column, Academic Institution

A modification of the previous function to create scatter plots of the numerical values vs numerical values and check the distribution of the attribute values against the target column, Academic Performance

A modification of the previous function to create scatter plots of the numerical values vs numerical values and check the distribution of the attribute values against the target column, Academic Institution.

This function plot histograms of the categorical values against the 'Academic Performance' column.

These are helper functions.

This is the main function.

The following function does the same thing with respect to 'Academic Institution'.

The following function does the same thing with respect to 'Browsing Purpose'

The following function does the same thing with respect to 'Academic Institution'

This function add value counts on top of each bar in the histogram.

Now let's start plotting the data.

Plotting Non-Categorical Values

Only 'Total Internet Usage(hrs/day)', 'Time Spent in Academic(hrs/day)', 'Duration Of Internet Usage(In Years)' are the non-categorical values in the dataset.

Let's plot the bar plot for each of the non-categorical attributes together.

Plotting Total Internet Usage(hrs/day)

First let's check the histogram and the boxplot of this column.

Now let's check the scatter plot.

Now let's try plotting Total Internet Usage(hrs/day) against the target column 'Academic Performance'.

Now let's try plotting Total Internet Usage(hrs/day) against the target column 'Academic Institution'.

Plotting Time Spent in Academic(hrs/day)

First let's check the histogram and the boxplot of this column.

Now let's check the scatter plot.

Now let's try plotting Time Spent in Academic(hrs/day) against the target column 'Academic Performance'.

Now let's try plotting Time Spent in Academic(hrs/day) against the target column 'Academic Institution'.

Plotting Time Spent in Academic(hrs/day) vs Total Internet Usage(hrs/day)

Let's use scatter plot.

Now let's try plotting Time Spent in Academic(hrs/day) vs 'Total Internet Usage(hrs/day)' against the target 'Academic Performance'.

Now let's try plotting Time Spent in Academic(hrs/day) vs 'Total Internet Usage(hrs/day)' against the target 'Academic Institution'.

Plotting Duration Of Internet Usage(In Years)

First let's check the histogram and the boxplot of this column.

Now let's check the scatter plot.

Now let's try plotting 'Years of Internet Use' against the target column 'Academic Performance'.

Now let's try plotting Time Spent in Academic(hrs/day) vs 'Duration Of Internet Usage(In Years)' against the target 'Academic Performance'.

Now let's try plotting Time Spent in Academic(hrs/day) vs 'Duration Of Internet Usage(In Years)' against the target 'Academic Institution'.

Now let's try plotting 'Total Internet Usage(hrs/day)' vs 'Duration Of Internet Usage(In Years)' against the target 'Academic Performance'.

Now let's try plotting 'Total Internet Usage(hrs/day)' vs 'Duration Of Internet Usage(In Years)' against the target 'Academic Institution'.

Plotting Categorical Values

'Gender', 'Age', 'Academic Institution', 'Frequently Visited Website', 'Effectiveness Of Internet Usage', 'Devices Used For Internet Browsing', 'Location Of Internet Use', 'Household Internet Facilities', 'Time Of Internet Browsing', 'Frequency Of Internet Usage', 'Place Of Student's Residence', 'Purpose Of Internet Use', 'Browsing Purpose', 'Webinar', 'Priority Of Learning On The Internet', 'Academic Performance', 'Barriers To Internet Access' are the categorical values in the dataset.

Plotting 'Gender'

Let's check the histogram.

Plotting 'Age'

Let's check the histogram.

Plotting Frequently Visited Website'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Browsing Purpose'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Effectiveness Of Internet Usage'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Devices Used For Internet Browsing'

Let's check the histogram.

Plotting 'Location Of Internet Use'

Let's check the histogram.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Household Internet Facilities'

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Time Of Internet Browsing'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Frequency Of Internet Usage'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Place Of Student's Residence'

Let's check the histogram.

Plotting 'Purpose Of Internet Use'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Browsing Purpose'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Webinar'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Priority Of Learning On The Internet'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Internet Usage For Educational Purpose'

Let's check the histogram.

Let's check the distribution of this feature against the target i.e. 'Academic Performance'.

Let's check the distribution of this feature against the target i.e. 'Browsing Purpose'.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Academic Performance'

Let's check the histogram.

Let's check the distribution of this feature against 'Academic Institution'.

Plotting 'Barriers To Internet Access'

Let's check the histogram.

Let's check the distribution of this feature against 'Academic Institution'.

Inspecting Age Closer

Let's define a function to make this process easier.

Now let's inspect the columns 'Total Internet Usage(hrs/day)', 'Duration Of Internet Usage(In Years)', 'Time Spent in Academic(hrs/day)' against the column 'Age' and also segment the distribution by the target 'Academic Performance'.

Multivariate Analysis

Multivariate analysis (MVA) is based on the principles of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. Typically, MVA is used to address the situations where multiple measurements are made on each experimental unit and the relations among these measurements and their structures are important.

Let's add hue = "Academic Performance" in the pairplot

Correlations

We are going to use pearson correlation for to find linear relations between features, heatmap is decent way to show these relations.

Start Predicting the Models

Let's drop the target column 'Academic Performance' from the main dataframe. Store the target column on a separate column first.

Let's separate the numerical and categorical columns for preprocessing. Let's check which columns are numerical and which are categorical.

The columns 'Age', 'Total Internet Usage(hrs/day)', 'Time Spent in Academic(hrs/day)', 'Duration Of Internet Usage(In Years)' contain numerical values. Let's separate them from the main dataframe.

Store the numerical attributes in a separate variable.

Let's integerize the categorical values in the dataset university_cat. We'll use the LabelEncoder from the sklearn.preprocessing.

Let's Normalize the dataset using sklearn's normalize function. But the dataset seems to perform better without normalization.

Let's combine the preprocessed numerical and categorical part of the dataset.

Split the dataset for training and testing purposes. We'll use sklearn's train_test_split function to do this.

Implementing Machine Learning Algorithms For Classification

Stochastic Gradient Descent

Let's start with Stochastic Gradient Descent classifier. We'll use sklearn's SGDClassifier to do this. After training the classifier, we'll check the model accuracy score.

Let's check the confusion matrix and classification report of this model.

Let's perform cross validation using this model. We'll KFold for this purpose.

Let's check the score.

Let's plot the training accuracy curve. But first we'll train and predict the model with max_iter in the range of (5, 300)

Let's check the scores variable.

Finally, let's plot the training score.

Testing score.

Let's combine the two scores together to compare the two.

Decision Tree

Let's start with Decision Tree classifier. We'll use sklearn's DecisionTreeClassifier to do this. After training the classifier, we'll check the model accuracy score.

Let's check the confusion matrix and classification report of this model.

Let's perform cross validation using this model. We'll KFold for this purpose.

Let's check the score.

Let's plot the training accuracy curve. But first we'll train and predict the model with max_depth in the range of (1, 27)

Let's check the scores variable.

Finally, let's plot the training and testing scores together so that we can compare the two.

Logistic Regression

Let's start with Logistic Regression classifier. We'll use sklearn's LogisticRegression to do this. After training the classifier, we'll check the model accuracy score.

Let's check the confusion matrix and classification report of this model.

Let's perform cross validation using this model. We'll KFold for this purpose.

Let's check the score.

Let's plot the training accuracy curve. But first we'll train and predict the model with max_iter in the range of (50, 200)

Let's check the scores variable.

Finally, let's plot the training and testing scores together so that we can compare the two.

Random Forest

Let's start with Random Forest classifier. We'll use sklearn's RandomForestClassifier to do this. After training the classifier, we'll check the model accuracy score.

Let's check the confusion matrix and classification report of this model.

Let's perform cross validation using this model. We'll KFold for this purpose.

Let's check the score.

Let's plot the training accuracy curve. But first we'll train and predict the model with n_estimators in the range of (1, 35)

Let's check the scores variable.

Finally, let's plot the training and testing scores together so that we can compare the two.

Naive Bayes

Let's start with Naive Bayes classifier. We'll use sklearn's GaussianNB, MultinomialNB and CategoricalNB to do this. After training the classifier, we'll check the model accuracy score.

Both GaussianNB and MultinomialNB have the same training accuracy.

Let's check the confusion matrix and classification report of GaussianNB model.

Let's perform cross validation using this model. We'll KFold for this purpose.

Let's check the confusion matrix and classification report of this model.

Check Feature Importance

Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features. The code below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the Mobile Price Range Prediction Dataset.

Feature Importance

We can get the feature importance of each feature of our dataset by using the feature importance property of the model. Feature importance gives a score for each feature of the data, the higher the score more important or relevant is the feature towards our output variable. Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

Let's plot the top 10 most important features.

Correlation Matrix with Heatmap

Correlation states how the features are related to each other or the target variable. Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable) Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

Hyperparameter Optimization

hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

We'll perform hyperparameter optimization using the following optimization techniques:

  1. GridSearchCV - Exhaustive search over specified parameter values for an estimator.
  1. RandomizedSearchCV - Randomized search on hyper parameters. The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.
  1. BayesSearchCV - Bayesian Optimization of model hyperparameters provided by the Scikit-Optimize library.
  1. Genetic Algorithm using the TPOT library - TPOT is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Genetic Programming stochastic global search procedure to efficiently discover a top-performing model pipeline for a given dataset.

Let's start with GridSearchCV.

Hyperparameter Optimization using GridSearchCV

As we saw, the algorithms that performs the best is the LogisticRegression and RandomForestClassifier. Let's try and optimize the RandomForestClassifier algorithm more to get a better result. First let's see the parameters that we'll try and tune in the RandomForestClassifier.

Let's create a dictionary that defines the parameters that we want to optimize.

Now, let's optimize the model using GridSearchCV. The method we'll use for cross validation is RepeatedStratifiedKFold.

Let's check the training score. It should be performing much better now.

Let's put the model to use and predict our test set.

Hyperparameter Optimization using RandomizedSearchCV

As we saw, the algorithms that performs the best is the LogisticRegression and RandomForestClassifier. Let's try and optimize the RandomForestClassifier algorithm more to get a better result. First let's see the parameters that we'll try and tune in the RandomForestClassifier.

We'll use the same dictionary that we created before as the parameters that we want to optimize. Now, let's optimize the model using RandomizedSearchCV. The method we'll use for cross validation is RepeatedStratifiedKFold.

Let's check the training score. It should be performing much better now.

Let's put the model to use and predict our test set.

Hyperparameter Optimization using BayesSearchCV

As we saw, the algorithms that performs the best is the LogisticRegression and RandomForestClassifier. Let's try and optimize the RandomForestClassifier algorithm more to get a better result. First let's see the parameters that we'll try and tune in the RandomForestClassifier.

we'll use the same dictionary that we created before as the parameters that we want to optimize. Now, let's optimize the model using Bayesian Optimization implemented in BayesSearchCV. skopt library contains this class. The method we'll use for cross validation is RepeatedStratifiedKFold.

Let's check the training score. It should be performing much better now.

Let's put the model to use and predict our test set.

Hyperparameter Optimization using Genetic Algorithm

Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent exploitation of random search provided with historical data to direct the search into the region of better performance in solution space. They are commonly used to generate high-quality solutions for optimization problems and search problems.

Genetic algorithms simulate the process of natural selection which means those species who can adapt to changes in their environment are able to survive and reproduce and go to next generation. In simple words, they simulate “survival of the fittest” among individual of consecutive generation for solving a problem. Each generation consist of a population of individuals and each individual represents a point in search space and possible solution. Each individual is represented as a string of character/integer/float/bits. This string is analogous to the Chromosome.

To implement genetic algorithm we'll use TPOT which is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Genetic Programming stochastic global search procedure to efficiently discover a top-performing model pipeline for a given dataset.

We'll first have to numberize the training and test label set. Here we use sklearn's LabelEncoder class to implement this.

Here we see our labels are encoded according to the following:

  1. Excellent - 1
  1. Good - 2
  1. Average - 0
  1. Not Satisfactory - 3

Let's finally train the Genetic Algorithm using TPOTClassifier. We are currently using 15 generations, 100 population_size and 150 offspring_size.

Genetic algorithm showed us that the most optimized algorithm is the KNeighborsClassifier with the following parameter :

KNeighborsClassifier(CombineDFs(RandomForestClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.3, min_samples_leaf=1, min_samples_split=7, n_estimators=100), input_matrix), n_neighbors=3, p=1, weights=distance) 0.7

Let's fit this algorithm to our dataset and check the training score

Let's check the accuracy on the test set and check the confusion matrix, precision, recall and f1 scores.

Finally, let's perform KFold cross validation.

This model givess us a 61% accuracy on KFold cross validation.